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Abstract 

A problem of substantial interest is to systematically map variation in chromatin structure 
to gene expression regulation across conditions, environments, or differentiated cell types. We 
developed and applied a quantitative framework for determining the existence, strength, and 
type of relationship between high-resolution chromatin structure in terms of DNasel hypersen- 
sitivity (DHS) and genome-wide gene expression levels in 20 diverse human cell lines. We show 
that ^25% of genes show cell-type specific expression explained by alterations in chromatin 
structure. We find that distal regions of chromatin structure (e.g., ±200kb) capture more genes 
with this relationship than local regions (e.g., ±2.5kb), yet the local regions show a more pro- 
nounced effect. By exploiting variation across cell-types, we were capable of pinpointing the 
most likely hypersensitive sites related to cell-type specific expression, which we show have a 
range of contextual usages. This quantitative framework is likely applicable to other settings 
aimed at relating continuous genomic measurements to gene expression variation. 

Abbreviations: ARS, Angle Ratio Statistic; DHS, DNasel hypersensitivity; SI, Supplementary 
Information 

Note: The Supplementary Information may be found among the source files in arxiv_SI.pdf. 



1 



1 Introduction 



Humans, like all other multicellular organisms, possess a large number of distinct cell-types, each 
of which is specialized for a particular function within the body. Cells from a variety of tissue types 
exhibit different gene expression profiles relating to their function, where typically only a fraction 
of the genome is expressed. As all somatic cells share the same genome, specialization is in part 
achieved by physically sequestering regions containing non-essential genes into heterochromatin 
structures. Genes which are needed for the particular task of the cell-type display an accessi- 
ble chromatin structure allowing for the binding of transcription factors and other related DNA 
machinery and subsequent gene expression. 

To date, most studies have been limited to considering the chromatin accessibility surrounding 
the promoter region of genes, which is typically proximal (<10kb) to the transcription region in 
just one or very few cell-types or experimental conditions [53, 4, 43] . However, it is also of interest 
to understand how larger regions (3>10kb) of chromatin structure relate to a gene's expression 
variation across multiple cell types, disease states, or environmental conditions. Recently, several 
large-scale international collaborations have started to generate data that can be used for this 
purpose [38], although doing so requires new developments in computational methods [21, 22, 11]. 

To this end, we undertook a genome-wide investigation to characterize the relationship between 
variations in chromatin structure and gene expression levels across 20 diverse human cell lines (SI, 
Table SI). We utilized data on chromatin structure as ascertained through DNasel hypersensitivity 
(DHS) measured by next-generation deep sequencing technology, and gene expression data mea- 
sured by Affymetrix exon arrays. Replicated data on 10 cell lines were also utilized to assess the 
robustness of our method. 

Relating DHS to gene expression levels across multiple cell-types is challenging because the 
DHS represents a continuous variable along the genome not bound to any specific region, and the 
relationship between DHS and gene expression is largely uncharacterized. In order to exploit varia- 
tion across cell-types and test for cell-type specific relationships between DHS and gene expression, 
the measurement units must be placed on a common scale, the continuous DHS measure associated 
to each gene in a well-defined manner, and all measurements considered simultaneously. Moreover, 
the chromatin and gene expression relationship may only manifest in a single cell-type, making 
standard measures of correlation between the two uninformative because their relationship is not 
linear over a continuous range, as shown in Fig. 1 (further details in SI and Figs. S2-S6). 

The computational approach developed here provides a powerful, tractable, and intuitive way 
of representing these data and capturing biologically informative relationships. We were able to 
characterize the level to which variation of chromatin accessibility is associated with gene expression 
variation in a cell- type specific manner. Within genomic segments of significant chromatin gene 
expression concordance, our methodology is further capable of pinpointing the most likely local sites 
related to the detected association. We show that such sites are context specific and can be shared 
across genes within a single cell-type or across several cell-types. Our quantitative framework has 
some generality in that it may be readily applied to associate any quantitative measure along the 
genome to gene expression variation. 
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Fi gUrG 1 ! Overview of Data and Proposed Approach. (A) Gene expression measurements for twenty cell lines on an 
example gene, HNFJ^A. (B) DNase-I Hypersensitivity (DHS) fragment sequencing counts in a region about the gene. (C) The 
DHS signal is captured by summing the overall number of fragments over a given segment size (e.g., ± lOOkb) about the gene's 
transcriptional start sight (TSS) to obtain a "DHS volume". After global normalization, the gene expression data and DHS 
volume measures are scaled to lie on the unit interval [0,1] and the data are centered about the origin according to the two- 
dimensional medoid. For the HNF^A example, three outliers are clearly visible; for example, HepG2 displays both chromatin 
accessibility and active gene expression, whereas Hela displays only chromatin accessibilty. The goal is to quantitatively capture 
the isolated relationship seen in HepG2 and assess whether this relationship is statistically significant. Traditional measures of 
linear correlation are not suitable for identifying this type of signal, as shown by the substantial change seen after removal of 
a single cell line, Hela, even though the data for Hela are expected to exist for many genes and cell lines. The proposed ARS 
is robust to Hela since the measure is based on angular placement and the median distance to the medoid of the data (dashed 
circle). (D) The ARS statistic is calculating by first quantifying the relative distance to the origin for each cell line in a robust 
manner. An angular penalty for each cell line is then calculated to quantify cell-types concordant in both expression and DHS 
measured. This quantity is measured in terms of angular distance from the 45° degree line, and it is then multiplied times its 
respective relative distance to give and overall score for each cefl line. The maximum score is taken as the statistic for the given 
gene, allowing a comparison across all genes. (E) A local version of the ARS statistic we introduce can pinpoint DHS "peaks" 
contributing the most to the detected association. See main text for details on the proposed methods. 



2 Results 



2.1 Genome- wide profiling of chromatin accessibility and gene expression 

We utilized data on genome-wide, high-resolution chromatin accessibility measurements for 20 
distinct human primary and culture cell lines that were obtained by an established sequencing-based 
method [34]. In principle, accessible "open" chromatin is cleaved by the non-specific endonuclease 
DNasel, and the cleaved fragments are sequenced to provide a high-resolution, genome-wide map of 
DNasel hypersensitivity (DHS) for every cell-type (SI, Table S2). The interpretation of these data is 
that increased fragment counts within a region are indicative of greater chromatin accessibility. To 
investigate the impact of regional chromatin accessibility on gene expression variation, we likewise 
utilized genome-wide gene expression measurements in each cell line from Affymetrix exon arrays 
(SI, Table S4). A total of 19,215 genes were analyzed after preprocessing (Methods). 

With these quantifications, we sought to characterize the relationship between chromatin ac- 
cessibility and gene expression in a cell-type specific manner, summarized in Fig. 1. To this end, 
the cell-type specific chromatin profiles were quantified by integrating the DHS fragment counts 
over increasingly larger genomic segments relative to the gene of interest (SI, Fig. S7) to obtain 
a cell-type specific regional DHS volume. We selected a range of segments that were likely to 
encompass all proximal (TSS ±2.5kb) and most distal regulatory elements (TSS ±50kb, ±100kb, 
±150kb, ±200kb, ±100kb minus proximal 2.5kb, and ±200kb minus proximal 2.5kb). Addition- 
ally, to account for copy number variation and chromosome arm related effects, the obtained DHS 
volumes were scaled on either side of the centromere to arrive at equilibrium across samples (SI, 
Fig. S8). Alternative representations of DHS signal [11, 6] could be utilized at this step, although 
we did not identify any advantages in doing so. Gene expression values were summarized as the 
mean intensity across all probe-sets linked to a given RefSeq-gene. 

2.2 Detecting cell-type specific chromatin accessibility and gene expression con- 
cordance 

Due to the "on-off" nature of DHS and subsequent transcription, there will not necessarily be a 
linear relationship between DHS and gene expression measures. Using correlation or correlation- 
like statistics to associate the two measurements across all cell-types proved to be unreliable and 
uninformative (further details in SI and Figs. S2-S6). One of the key types of relationships we 
sought to detect is of the type shown in Fig. 1, where one or very few cell types are outliers from 
the others. The standard Pearson correlation statistic is not well-suited for this scenario. First, 
it requires the data to be jointly Normal in order to obtain parametric p-values, but the Normal 
assumption does not hold for these data (SI, Fig. S3). Second, this correlation statistic is unstable 
when there are outliers, even when using permutation based p-values, demonstrated directly on 
these data (SI, Figs. S2 and S4). The rank-based Spearman correlation statistic is a potential 
alternative, but it shows very poor power relative to the method proposed here as shown in Fig. 2. 
(See also SI, Figs. S5 and S6). For example, at a false discovery rate (FDR) < 0.05, the proposed 
method identifies 2538 genes with a cell-type specific DHS and gene expression relationship whereas 
the Spearman statistic identifies only 286 (Figs. 2, S5, and S6). 

The new statistic proposed here is designed to be appropriate for scenarios when both mea- 
surements are restricted to a narrow relative range with one or very few cell-types appearing as 
distinct outliers. To evaluate the relationship between the DHS volume of a genomic segment and 
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gene expression, we took into account the compactness of the measurements versus any distinct 
outliers in both dimensions and whether the outliers were concordant in both measurements (i.e., 
a simultaneous increase or decrease) to form an overall composite measure called an Angle Ratio 
Statistic (ARS) (detailed in Fig. 1, Fig. SI, Methods, and SI). To summarize, we first scale and 
median center the DHS volume and expression data, respectively, for a given gene. We then calcu- 
late the relative distance of each cell type to the overall center of the data, which serves as a way 
to measure the degree to which each cell type is an outlier. In order to measure concordance of 
DHS volume and gene expression, we calculate the angular distance between each point and the 45° 
line of identity, penalizing points further away from the line of identity according to a data-derived 
exponential function. These two quantities are then multiplied to form an ARSj value for each 
cell type (i = 1,2,..., 20), and the maximal value ARS max is the overall statistic that quantifies 
cell-type specific DHS volume and gene expression concordance for a given gene. 

To identify statistically significant genes from ARS max , we constructed a null distribution based 
on randomization of the observed experimental data (see Methods, SI, and Fig. S9). ARS max values 
obtained from the randomized data were used as a basis for determining a p- value of the observed 
ARS max for each gene. False- discovery rate (FDR) based statistical significance and the proportion 
of genes with a true chromatin accessibility and gene expression relationship were estimated from 
the p- values [45] (Figure 2b). 

We estimate that ~25% of genes show concordance between chromatin accessibility and gene ex- 
pression variation in a cell-type specific manner. While our strategy is capable of detecting outliers 
showing negative concordance (decreased chromatin accessibility and decreased gene expression), 
none were found to be significant at FDR < 0.05. The number of significant genes increased by 
inclusion of distal DHS volume (Fig. 2b, column 2), indicating that distal chromatin programming 
effects are more widespread in a genome-wide sense. On the other hand, using the proximal DHS 
volume we observe a greater empirical effect size compared to the distal DHS volumes (Fig. 2b, 
columns 3-5). 

This observation is explained by the aggregation of genes significant for the same cell-type 
along the genome [44]. Testing whether one or more significant genes within a ±100kb region were 
associated with the same cell- type we found that 481 out of 668 significant genes within the specified 
boundary stem from the same cell-type (Fishers exact test p-value < 2.2e-16; SI and Fig. S15). It 
is however important to note that inclusion of increasingly distal regions also increases the noise in 
the DHS volume, wherefore the effect size and ultimately the number of true associations starts to 
decline (Fig. 2a). 

2.3 Experimental replication 

To assess reproducibility, we tested the concordance of significant results among replicated data for 
10 cell-types. Based on two independent measurements of DHS and gene expression, respectively, 
we calculated the fraction of predictions preserved in all four- way comparisons (SI). We found that 
between 86% to 91% of significant genes (FDR < 0.05) were identical (SI, Fig. S16). 

2.4 Gene ontology and pathway analysis 

To determine the biological coherence of the set of genes found to be significant for each cell-type, 
we performed a gene ontology (GO) enrichment analysis [10]. The method computes enrichment 
within the process and function components of GO categories and assigns a numerical significance 
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Figure 2: Statistical significance for ARS and correlation across genomic segments (A) Depicts the number of 
significant genes found at increasingly larger genomic segments for ARS and Spearman correlation, respectively (solid line is 
ARS and dashed line is Spearman correlation). (B) Statistical significance according to DHS volume segment size. Column 2 
shows the percentage of genes estimated to have concordant DHS volume and gene expression variation as captured by ARS max 
(1 — 7rn> as estimated in [45]). Columns 3-5 show the number of statistically significant genes at various FDR cut-offs. While the 
2.5Kb window shows more significant genes at the stringent FDR cut-offs, indicating a larger effect size, the overall percentage 
of genes showing a relationship is notably lower than the more distal DHS volumes. Compared to Spearman correlation, ARS is 
more powerful at detecting these associations (see SI for further details). (C) The relative ARS,: values across all cell-types for 
significant genes in the ±100kb region versus the analogous components for Spearman correlation (the cross product terms that 
sum to form the overall correlation). The ARSi values distinguish cell lines that have a strong DHS and expression concordance 
substantially more clearly than the Spearman correlation, showing that the traditional correlation is more likely to generate 
spurious results from small changes to the data. Enrichment gf biological functions for the significant genes found by either 
method corroborates this finding (see SI, Fig. S14) 



to the findings. In nearly all cases the results were in agreement with the actual biology; see 
http://encode.princeton.edu/ for results on all DHS segment sizes. For example, human T- 
cells showed a strong enrichment of T-cell receptor related genes, whereas hepatic cells showed 
enrichment of lipid metabolism related genes. KEGG pathways [24, 7] were likewise enriched in 
a cell-type specific manner. For example, HepG2 showed significant enrichment for genes within 
the bile acid synthesis and drug metabolism, while HL60 showed significant enrichment within the 
hematopoietic cell lineage (data not shown). 

Furthermore, all genes detected within each cell type at FDR < 0.05 (± lOOkb DHS volume) were 
analyzed through the use of Ingenuity Pathways Analysis (Ingenuity Systems®, www . ingenuity . com) 
For all but three cases out of 20 (two cell-types likely had too few significant genes detected to 
get reliable annotations), the category "Physiological System Development and Function" was in 
clear correspondence with that expected given the cell type, (SI, Fig. S14). For instance, TH1 was 
enriched for "cell- mediated immune response", K562 for "hematological system development and 
function" , and H7ESC for "embryonic development" . For each gene, there tended to be low relative 
ARSj across the remaining cell types, indicating that we detected truly cell- type specific genes as 
clear outliers on a genome-wide scale. However, some cases showed large relative ARSj in a few 
tissues, which prompted us to investigate these instances further. 

Among genes with a statistically significant ARS max statistic, additional inspection of the re- 
maining ARSj were explored for detection of possible sub-structures. We calculated relative ARS 
values within each gene dividing all ARSj by ARS max . In addition to many instances of singu- 
lar outliers, we detected a gradient behavior among significant genes, where a few cell-types were 
evident as outliers (Fig. SI 7). 

2.5 Local ARS Profiles 

The DHS data itself provides a rich source of information about regulatory elements in the genome. 
However, when used in conjunction with gene expression data across differing cell types, there is 
an opportunity to discover which locations of chromatin accessibility drive gene expression in a 
cell-type specific manner. This goal prompted us to develop a technique to model the relationship 
for fine-scale segments of DHS volume across the larger segments. As the above strategy focused 
on examination of chromatin gene expression interactions over genomic segments, investigation of 
fine-scale patterns allowed us to: (i) validate that distal regulatory regions were indeed present as 
peaks in chromatin accessibility concordant with gene expression in a cell-type specific manner, (ii) 
perform sequence analyses of these chromatin accessibility peaks, (iii) compare localized associations 
across cell-types or within a single gene, and (iv) provide a framework for quantifying regions of 
interest on a continuous scale for investigation of regulatory elements. 

We therefore extended our approach to allow one to identify and map DHS sites to genes 
on which they show strong evidence for playing a regulatory role in a cell-type specific manner. 
This was carried out by providing a fine-scale version of the ARS quantification, called a "local 
ARS profile" for genes with a statistically significant ARS max statistic over a larger segment. The 
peaks of the local ARS profiles pinpoint which DHS are most influential in explaining the cell-type 
specific gene expression variation, thereby indicating that they have the most regulatory potential. 
We retained the gene expression values for a given significant gene, and now considered the DHS 
volume within non-overlapping consecutive regions at a high resolution 60 base pair windows. The 
ARS statistic was calculated for each 60bp window, which can then be plotted over the entire 
region used in identifying the gene as statistical significant. For example, for a gene significant 
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with respect to a ±200kb DHS volume, we calculated ~6700 local ARS statistics for each cell type. 
These can then be plotted in such a way that the signal emanating from that location is visible, 
loosely analogous to a LOD score profile in linkage analysis. Additional steps were taken, involving 
scaling across the 60 bp windows to preserve a valid interpretation of their relative magnitudes 
(SI). 

We first selected the subset of local ARS profile "peaks" by thresholding the local ARS profiles in 
a principled manner (Methods) , and we analyzed both positional biases and sequence compositions 
as they relate to function. We then analyzed the entire trajectories of local ARS profiles at specific 
loci, showing that they identify both known and putative regulatory DHS for given genes. 

2.6 Positional bias of putative regulatory DHS 

Because the overall statistical significance increases when calculating DHS volume over more distal 
regions up to 200kb (Fig. 2), we investigated the positional bias of local ARS peaks in a cell- 
type specific manner. Figure 3a shows smoothed densities of positional local ARS peak counts 
by cell type, which exhibit high cell-type specific differences, specifically the density around the 
TSS. Random densities were generated by randomly assigning positional counts to tissues in equal 
proportions to the observed counts, where it can be seen that the cell-type differences are no longer 
present (SI, Fig. S18). This points to the existence of cell-type specific biases in the base-pair 
distance of regulatory DHS to TSS. 

2.7 Sequence analysis of peaks in local ARS profiles 

We next sought to characterize the functional significance of sequences corresponding to local ARS 
peaks. Since a general indicator of functionality is conservation, we extracted the conservation 
track values (phastCons44wayPrimate, hgl8) [41] corresponding to the local ARS peaks and to 
the negative control set (Methods). Values range from to 1, with 1 indicating the most con- 
served. The regions with local ARS peaks were significantly more conserved than regions from the 
negative control set (Kolmogorov-Smirnov p-value < 2.2e-16, SI, Fig. S19), indicating substantial 
conservation of sequences corresponding to local ARS peaks. 

DNase-I hypersensitive sites are well established markers of regulatory and other DNA binding 
proteins. We therefore sought to establish if known cell-type specific transcription factors binding 
sites (TFBSs) are over-represented in the local ARS peaks relative to the negative control set 
(Methods). Since regions distal to the TSS are rarely studied in this context, we eliminated all 
local ARS peaks and negative controls that fell within ±10 kb of the TSS. This step was taken to 
demonstrate that the proposed approach is capable of detecting distal TFBS, up to 200kb from the 
TSS. 

We utilized the JASPAR database [31] to identify TFBS that are differentially represented in the 
local ARS peaks relative to the negative control set (Methods). The over- and under-represented 
TFBS show distinct cell-type specific patterns and provide a rich insight into cell-type specific gene 
regulation (Fig. 3b), several of which are listed here: 

• Among the hepatocyte nuclear factors we found HNF1B (TCF-2) and HNF4A to have signif- 
icant chromatin gene expression concordance in HRCE and HepG2, respectively (SI, Fig. S20 
and Fig. S21). Furthermore we found the local ARS profiles in the respective tissues to dis- 
play a marked over-representation of the factor in question, HNF1B in HRCE and HNF4A 
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Figure 3: Analysis of local ARS profiles. (A) Distribution of local ARS peaks relative to the TSS according to 
cell type. The positional bias of cell-type specific local ARS peaks as measured by the density of local ARS peaks 
within cell lines with respect to position from to the TSS. Clear differences in the amount of distal regulation are 
seen across the cell-types and the density around the TSS differ markedly among cell types. For example, HL60 
shows a more proximal signal relative to that of HAEpiC. (B) Transcription factor binding site analysis among local 
ARS peaks occurring lOkb to 200kb from the TSS. Sequences corresponding to local ARS peaks within significant 
cell-type specific genes were searched with known transcription factor binding site models, and the relative over- 
and under-representation was assessed based on a negative control set. Instances of absolute log 2 fold-change > 2 
are displayed within the relevant cell types. Over-representation is indicative of a preferential transcription factor 
binding site, and is therefore a likely regulatory candidate for the observed gene expression. Under-represented sites 
indicate factors which should be avoided to maintain proper cell-type specific expression profiles. For instance, Sox2 
and Pou5fl (Oct4) were observed solely over-represented in the embryonic cells, H7ESC. 
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in HepG2. Mutations in HNF1B have been associated with a broad range of renal diseases 
[50], and HNF4A is essential for hepatocyte differentiation and morphology [29]. 

• H7ESC was found to show over-representation of SOX2 and POU5F1 (Oct-4) both essential 
for self-renewal in undifferentiated stem cells. 

• NFYA (a CCAAT-binding protein) was found over-represented in almost all tissues. This 
factor is essential for enhancer function by requiting distal transcription factors to the proxi- 
mal promoter region [52]. The ubiquitous CCAAT-binding factor family is linked to cellular 
differentiation in a variety of tissues [26]. 

• RXR-RAR was found in HAEpiC (human amniotic epithelial cells). The co-expression of 
the retinoic acid receptors (RARs) and the retinoid X receptors (RXRs) [37] are essential for 
proper placental development, and retinoid X receptor (RXR) null mouse mutants are lethal 
after 10 days due to placental defects [36]. 

• Forkhead binding sites were found to be primarily under-represented, specifically FOXD3 
was under-represented in, among others, the leukemic cell-types. Silencing of FOXD3 by 
aberrant chromatin modification has been implicated in leukemogenesis [37] . Over-expression 
of FOXD3 prevents neural crest formation [30] . Interestingly, binding sites for the factor were 
under-represented in SKNSH, a neuroblastoma derived from neural crest cells. 

• NF-kB was found over-represented in TH1, where it promotes the expression of, among others, 
interleukin 12 (IL-12) essential for TH1 development [28]. 

The differentially represented TFBSs were distributed largely distal. For all cell-types, from 68% 
to 79% were located more than ±50kb away from the TSS. We repeated the analysis with only 
the proximal regions (±10kb from the TSS), and we found that important known cell-type specific 
motifs were no longer detected (SI, Fig. S22). 

2.8 Mapping putative regulatory DHS to genes 

We also investigated the utility of considering the entire trajectory of local ARS profiles at a locus to 
characterize the regulatory architecture of cell-type specific expression. We investigated in detail 
two well-characterized examples of regulatory interactions at the /3-globin {HBB) locus control 
region and at the stem cell leukemia (SCL) gene, also known as TALI, with several more appearing 
in SI (Figs. S23-S35). It can be seen from these analyses that the local ARS profiles provide a 
means to map DHS sites to genes in a cell-type specific manner. 

The HBB (/3-globin) locus control region (LCR) comprises an array of functional elements that 
in vivo gives rise to five major DNase I hypersensitive sites (HS1-HS5 [49, 14, 19], Fig. 4) upstream 
of HBE1 (e-globin) on the short arm of chromosome 11. All five sites were present in cell line 
K562 according to our DHS data (see Figs. S23- S27 for complete data across all 20 cell-types, 
and Fig. S28 for local ARS profiles across all genes at this locus control region). Although the 
DHS volume at these sites contributed to both HBE1 and HBB yielding statistically significant 
ARS values, the relative importance of HS1-5 differs significantly between these two genes, clearly 
detected by the local ARS profiles (Fig. 4a). 

In the case of HBE1, we observed local ARS peaks for HS1 at -6.1kb and to a lesser extent 
HS3 and HS4 (-14.7 and -18kb respectively). For HBB we observed similar local ARS profiles for 
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Figure 4: Mapping putative regulatory DHS with local ARS profiles at two loci. (A) /3-globvn locus 
control region. The DHS data for five cell lines (out of 20) are shown, as well as the local ARS profiles for HBB 
and HBE1 in the K562 and HL60 cell lines. The transparent yellow boxes indicate regulatory regions, specifically 
hypersensitive regions 1-5 (HS1-5), together with a less characterized site upstream of HBD. It can be seen that HBB 
and HBE1 show different local ARS profiles, indicative of differences in usage of regulatory elements. The local ARS 
profile shows no peak in HL60 despite the existence of a hyper-sensitive site when considering DNasel profile alone. 
The full data and local ARS profiles for all 20 cell lines and both genes are displayed in Figs. S23-S27. (B) TALI 
locus. We identified TALI as statistically significant with its maximal ARS in the K562 cell line across all tested 
genomic segments. Local ARS profiles show a dominant effect from the +40 enhancer region (green box), spanning 
PDZK1IP1. DHS signals across multiple cell-types were correctly not detected to be associated with the expression 
of TALI. Furthermore, note that even though the DHS data for TALI and PDZK1IP1 are largely overlapping, they 
nevertheless have distinct local ARS profiles due to their different patterns of gene expression. This demonstrates that 
ARS is capable of separating interwoven signals across cell-types for neighboring genes, and that there is information 
to be gained by combining DHS and gene expression profiling. The full data for all 20 cell lines and local ARS profiles 
are displayed in Figs. S29-S31. 11 



HS1, HS3 and HS4, and smaller local ARS values for HS2 (-10.9kb). It has previously been shown 
that HS1 is a stable chromatin structure [14] through out development and essential for HBE1 
expression [40] due to a GATA-1 binding site, while HS2, 3 and 4 show synergistic enhancement 
of expression of HBB by formation of the LCR holocomplex [15, 27, 48]. Finally, the element 
upstream of HBD has also been reported to specifically enhance transcription of HBB [1]. While 
HS5 is present in the DHS data for K562, similar open chromatin structures were detected in other 
tissues. HS5 (-21kb) is not in concordance with tissue-specific gene expression of either HBE1 or 
HBB, an observation in line with this site's function as an insulator and CTCF binding site [47]. 

TALI encodes a basic helix-loop-helix protein which is essential for the formation of all hematopoi- 
etic lineages (SI, Figs. S29- S31 for all data across the 20 cell-types). Previous studies using chro- 
matin structure, comparative sequence analysis, transfection assays [13, 18], and transgenic mice 
[5, 17, 16, 35, 42] have identified a total of five enhancers modulating the expression of TALI. We 
detect TALI as significant with maximal cell line K562 across all tested genomic segments (from 
±2. 5KB to +200KB) with the most significant ARS max occurring for ±50kb. Further investigation 
by the local ARS profile (Fig. 4b) showed that while proximal regulatory sites were correctly iden- 
tified, the most dominant signal is by far confined to the +40 enhancer region and is an order of 
magnitude greater than other signals. While the TALI +40 region is downstream of PDZK1IP1, 
it was not linked to the expression of this gene which was detected as significant in HRCE. The 
+40 enhancer region has been shown to direct expression in transgenic mice to primitive, but not 
definitive erythoblasts, such as the phenotype displayed by K562. This example demonstrates that 
our methodology is capable of identifying regions of regulatory potential, which otherwise requires 
laborious effort to annotate. 

Local ARS profiles showed both differences and similarities across genes as well as cell-types. 
A few examples included: 

• CCR2 and CCR5 were significant for two different cell-types, HL60 and TH1, respectively 
(SI, Fig. S32). 

• Part of the HOX-cluster crucial for kidney development in mammals (HOXD8, HOXD4, and 
HOXD3) showed identical local ARS profiles (SI, Fig. S33), and all were significant genes in 
HRCE [9]. 

• Another example of shared profiles, but across several cell-types instead of across several 
genes, was seen with LOXL2, a gene essential for biogenesis of connective tissue, which is 
detected as an outlier in SKMC and has high relative ARS values in HAEpiC and BJ (SI, 
Fig. S34). Further fine-scale investigation showed a solid overlap in the local ARS profiles 
(Fig. S35). 

These observations point to a potentially widespread sharing of regulatory mechanisms both across 
genes and cell-types. 

3 Discussion 

As the epigenome in multicellular organisms is a dynamic entity whose variation leads to repro- 
gramming of gene expression [51], it is a likely candidate in the etiology of disease complementary 
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to that of mutations in DNA [39, 20]. It is therefore of considerable interest to identify and char- 
acterize the regulatory regions contributing to gene expression variation with respect to a given 
disease. 

We have presented a framework for quantifying relationships between chromatin structure and 
gene expression across multiple conditions (here, cell-types), facilitating a new avenue for under- 
standing cellular responses by localizing and characterizing regions of regulatory potential. The lo- 
cal ARS profiles we introduced allow specific hypersensitive regions to be associated with condition- 
specific gene expression, thereby conferring contextual regulatory information not obtainable using 
DHS data alone. This effectively pinpoints a shortlist of primary candidates for further functional 
studies. We found the peaks from the local ARS profiles in statistically significant segments to be 
both highly conserved and enriched for known transcription factor binding sites as far as 200kb 
from transcription start sites. While beyond the scope of the current work, we believe our approach 
could be used in conjunction with quantitative trait analyses (QTL) to increase the power for de- 
tecting true cis- and trans-acting SNP by interfering with transcription factor binding sites which 
in turn leads to altered DHS signals in a similar manner as Degner et al. [8]. 

As measurements from high throughput sequencing platforms become commonplace in molec- 
ular biology, there will be an increasing demand for the development of new statistical approaches 
for these data. A major challenge is that sequencing measurements are rarely in units directly 
relatable to one another; e.g., DHS measures chromatin accessibility, ChlP-seq measures binding 
affinity, RNA-seq measures RNA molecule abundance, etc. Our framework provides the initial 
development of a statistic which captures relationships among these measurements and enables 
statistical testing of associations among them. Moreover, by exploiting variation across multiple 
conditions, the sensitivity of our approach should only increase with additional data and sources of 
variation. Hence, the presented framework can likely be applied to test for associations between ap- 
propriate continuous quantitative genomic measurements and gene expression, thereby facilitating 
a comparable basis for meta-analyses on the interplay of epigenetic features. 

4 Materials and Methods 
4.1 DHS and gene expression data 

The data used in this study were generated through the ENCODE consortium and are publicly 
available. Established cell lines and primary cells used in this study were procured from commercial 
or other sources as listed in Table SI. The cells were cultured as per the vendor recommendations, 
and individual cell growth protocols are available in the UCSC human genome browser. The DHS 
data are available at the UCSC genome browser by downloading the track IDs listed in Table S2 
and the web address shown therein. Normalized probe-level expression data were obtained from 
the Gene Expression Omnibus (GEO); the accession numbers for all arrays are shown in Table S4. 
Probes were mapped to genes according to HG18 using bowtie [25] allowing for 2 mismatches and 
up to 10 maps to the genome, including the best match. Only probe sets for which all probes 
had a unique best match and fully corresponded to exon boundaries found in RefSeq annotations 
(HG18) were retained for further analysis. If a RefSeq gene had multiple splice variants, these were 
aggregated to a meta-gene structure. In the rare event that a gene mapped to currently ambiguous 
regions (e.g., chr6_random) such regions were not included. To arrive at a gene specific expression 
value, the mean expression across all probe sets within the exon boundaries of the gene model was 
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calculated. This yielded expression measures for 19,215 genes on 20 cell lines. 
4.2 Statistical methods 

The ARS algorithm and statistical analyses were written in the R programming language [33]. The 
main ARS algorithm, results, GO-analyses, and preprocessed data are available at 
http://encode.princeton.edu/. Complete details of the ARS algorithm, including the null ran- 
domization strategy and estimation of the angular penalty, are provided in SI. 

A schematic of the method is shown in Fig. 1. We represented the measurements of a single 
gene by two paired vectors x = (x\, . . . ,x m ) for gene expression and y = (yi, . . . ,y m ) for DHS 
volume, where m is the number of cell- types under consideration (here, m = 20). To place the two 

variables on a common scale, each vector was scaled by its maximum observation x s = j-^ r 

and y s = r-^ r so that all values are now in [0, 11. Each vector was then centered by its 

J max{j/i,...,j/ m } L J J 

median med(x s ) and med(y s ) to form x* = x s — med(x s ) and y* = y s — med(y s ). Hence the 
data for a given gene and segment are now centered around the two-dimensional medoid where the 
center of mass of the data lies at the origin. If there is little variation across the multiple cell-types, 
all points would cluster around the medoid, while singular cell-types displaying greater variation 
would be present as distinct outliers (SI, Fig. S9 and Fig. S10). To gauge potential outliers the 

Euclidian distance di = x* 2 + y* 2 were calculated for every cell type i = 1, . . . , m to produce the 

distance vector d = (d\, . . . ,d m ). We formed a ratio statistic according to ri = mc ^ d ) , thereby 
quantifying the relative distance of each point to the medoid. 

While the ratios describe the dispersion of the data, it does not account for any concordance 
between the measurements. A perfectly concordant relationship between the two measurements 
would result in points lying along the 45° diagonal identity line. We therefore calculated the angle 
9i for each data point (x*,y*) relative to the unit vector (1, 0) for i = 1, . . . , m, where < 9i < 360. 
The angular penalty involves first calculating the smaller of the two angular distances between 
6i and the identity line, denoted as Aj. For example, Aj = 1 4=5 — 0j| for < 6>j < 135. The 
angular penalty is calculated as cij = exp(c x Aj), where c is determined empirically to satisfy 
a correct null distribution (SI). Therefore, the value aj measures the penalized angular distance 
of (x*,y*) from the identity line in a symmetric fashion (SI, Fig. Sll). The statistic applied to 
each (x*,y*) pair is then ARSj = aj x rj, with the gene's overall statistic being the maximum, 
ARS m ax = max(ARSi, ARS2, • • ■ , ARS m ). In addition to calculating these quantities for each gene, 
we also recorded the ordering of the cell types as determined by their relative ARSj values. 

Inclusion of the angular penalty had a twofold purpose. Firstly, it correctly eliminated points 
that were outliers in only one dimension, gene expression or DHS alone, and therefore not of interest 
here since there is no direct relationship between the two measurements. Secondly, penalizing such 
points acted as a tuning parameter adjusting for the degree of off-diagonal noise in data, and thereby 
ensured a correct null distribution and p-values (Fig. S12). The specific value of c was determined 
such that observed null p-values over (p > 0.5) had a Uniform(0,l) distribution according to a 
Kolmogorov-Smirnov test (SI). Reassuringly, this lead to nearly identical values for c across all 
genomic segment sizes of DHS volume considered (SI, Fig. S13). 

The scaled data x* and y s for all genes were aggregated into a single distribution in the unit 
square [0,1] x [0,1]. From this, randomized data sets were created by sampling 20 points that 
preserves the fact that either one point must lie on (1, 1) or two points lie on (x, 1) and (l,y), re- 
spectively. The 20 sampled points are then median centered and the ARS max statistic is calculated. 
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We performed this 100 times to generate 100 sets of null ARS max statistics for every gene (for a 
total of 100 x 19,215 null statistics). A p- value was then formed for each gene by calculating the 
frequency by which null statistics exceed the observed statistic. The p- values were then utilized to 
calculate FDR q- values for the genes, as previously described [45]. See SI for full details on this 
randomization method. 

4.3 Selecting local ARS profile peaks for further analysis 

We first identified genes called significant at FDR < 0.10 for the ARS analysis performed on the 
segment size of ± 200kb about the TSS. We recorded the maximal cell-type for each of these 
genes (i.e., the cell type yielding the ARS max value), producing a list of significant gene/cell- type 
pairs. We limited our selection of gene/cell- type pairs to those cell types that were maximal at this 
threshold for at least 100 genes. For each of these selected gene/cell- type pairs, we scaled its local 
ARS profile by the maximal value in the ± 200kb segment about the TSS. All DNA sequences ± 
50bp with scaled local ARS profile value > 0.5 were then selected as "local ARS peaks." Likewise, 
all DNA sequences ± 50bp with scaled local ARS profile value < 0.2 were selected as the "negative 
control set." The local ARS peak set consisted of a total of 38,819 lOObp regions, and the negative 
control set consisted of 156,060 lOObp regions. 

4.4 TFBS analysis 

We took the above local ARS peaks and negative control set, and we eliminated all segments 
within ± lOkb of the TSS, reducing the number of local ARS peak segments from 38,819 to 
32,063 and negative control segments from 156,060 to 148,423. These were searched with all non- 
redundant vertebrate positions count matrices in the JASPAR database [31]. The position count 
matrices were converted to position weight matrices using a uniform background, and a matrix 
specific thresholding of 0.8 of the maximal log-odds score was used. Significant over- or under- 
representation was determined by exact binomial tests where the probability was based on the 
frequency of hits per base pair in the negative control sequences. Effect-size was calculated as log 2 
fold-change between number of hits per base pair in the local ARS peaks versus the negative control 
set. 

Web Resource 

To provide an interface for the community to utilize the results from this work, the local ARS 
tracks across any given gene in any of the 20 cell-types can be calculated via our web-service at 
http://encode.princeton.edu/, where all results encompassing the larger DHS regions are also 
searchable. 
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